Fix ECDH key exchange race condition in distributed deployments by anhthii · Pull Request #111 · fystack/mpcium

anhthii · 2025-11-01T01:38:14Z

Fix ECDH Key Exchange Race Condition in Distributed Cluster

Problem Summary

The MPC cluster fails to start properly when nodes are deployed in a distributed environment (production). Nodes get stuck in an infinite ECDH key exchange retry loop, preventing the cluster from becoming ready and accepting requests.

Observed Symptoms

Node0 (successful):

23:12:41 - Register peerID=a8d392dd-9913-4ebe-8c52-4848cc84b962
23:12:41 - All peers are ready including ECDH exchange completion
23:12:42 - SigningConsumer: Sufficient peers ready, starting message consumption
23:12:44 - KeygenConsumer: All peers are ready, proceeding to consume messages

Node1 & Node2 (stuck):

23:13:38 - Starting to broadcast DH key nodeID=c4ee7396-9a72-4735-ab2e-a785b68d2bc0
23:13:43 - [ECDH exchange retriggerd] not all peers are ready
23:13:48 - [ECDH exchange retriggerd] not all peers are ready
23:13:53 - [ECDH exchange retriggerd] not all peers are ready
(repeats indefinitely...)

Impact:

Cluster never becomes fully operational
Only Node0 accepts requests, but cannot perform MPC operations (requires threshold)
Node1 and Node2 continuously retry ECDH exchange, consuming CPU and network resources
Production deployments fail to start

Root Cause Analysis

The Race Condition

There was a timing-dependent race condition in pkg/mpc/registry.go at the registerReadyPairs method (lines 129-134):

if len(peerIDs) == len(r.peerNodeIDs) && !r.ready {
    r.mu.Lock()
    r.ready = true
    r.mu.Unlock()
    logger.Info("All peers are ready including ECDH exchange completion")
}

The Bug: This code sets r.ready = true when all peers are registered in Consul, but does not verify that ECDH key exchange is actually complete. The log message claims "including ECDH exchange completion" but the code never calls r.isECDHReady() to verify this.

Why It Manifests in Production But Not Locally

Local Environment (Masked Bug)

When all nodes run on the same machine:

Near-zero network latency (~0.1ms)
NATS messages arrive almost instantly
Consul updates propagate immediately
By the time a node sees "all peers in Consul", ECDH keys have already been exchanged
The race condition exists but the window is so small it rarely triggers

Local Timeline:

Time 0ms:   Node0 starts → registers → broadcasts ECDH
Time 5ms:   Node1 starts → registers → broadcasts ECDH → receives Node0's key
Time 10ms:  Node2 starts → registers → broadcasts ECDH → receives all keys
Time 15ms:  All nodes complete ECDH (before Consul check happens)

Production Environment (Exposed Bug)

When nodes are on different servers:

Significant network latency (50-200ms+)
NATS message propagation takes time
Consul registration completes BEFORE ECDH keys are exchanged
The gap between Consul registration and ECDH completion is large enough to trigger the bug

Production Timeline:

Time 0ms:    Node0 starts → registers in Consul
Time 100ms:  Node1 & Node2 register in Consul
Time 150ms:  Node0 sees all 3 nodes in Consul ✓
             Node0 sets ready=true ❌ (BUG: ECDH not checked!)
             Node0 broadcasts ECDH key
Time 250ms:  Node1/Node2 receive Node0's ECDH key (network delay)
Time 300ms:  Node1/Node2 try to complete ECDH but are now stuck in retry loop

Why The Bug Causes an Infinite Loop

Node0 incorrectly marks itself ready before ECDH completes
Node0 sends health checks to peers with isEcdhReady=true
Node1/Node2 haven't received Node0's ECDH key yet, their isEcdhReady=false
When Node1/Node2 receive health checks from Node0, they see it's "ready" but they aren't
Node1/Node2 retrigger ECDH exchange every 5 seconds
However, they're already subscribed and broadcasting - the retrigger doesn't help
The cluster is stuck: Node0 thinks it's ready, Node1/Node2 never complete

Solution

Changes Made

1. File: `pkg/mpc/registry.go`

Change 1.1: Refactored ready state check

Replaced inline ready check with dedicated method:

// OLD (buggy)
if len(peerIDs) == len(r.peerNodeIDs) && !r.ready {
    r.mu.Lock()
    r.ready = true
    r.mu.Unlock()
    logger.Info("All peers are ready including ECDH exchange completion")
}

// NEW (fixed)
r.checkAndUpdateReadyState()

Change 1.2: Added `checkAndUpdateReadyState()` method

New method that properly validates BOTH conditions:

func (r *registry) checkAndUpdateReadyState() {
    // Count ready peers in readyMap
    readyPeersCount := 0
    for _, isReady := range r.readyMap {
        if isReady {
            readyPeersCount++
        }
    }

    // Only mark as ready when BOTH conditions are met:
    // 1. All peers are registered in Consul
    // 2. ECDH key exchange is complete
    if readyPeersCount == len(r.peerNodeIDs) && r.isECDHReady() && !r.ready {
        r.mu.Lock()
        r.ready = true
        r.mu.Unlock()
        logger.Info("All peers are ready including ECDH exchange completion")
    }
}

Key improvement: Now explicitly checks r.isECDHReady() which verifies all symmetric keys are established.

Change 1.3: Register ECDH completion callback

In NewRegistry(), set up callback to re-check ready state when ECDH completes:

// Set up callback to check ready state when ECDH completes
ecdhSession.OnKeyExchangeComplete(func() {
    reg.checkAndUpdateReadyState()
})

Why this is critical: Handles the case where all peers are in Consul but ECDH completes later. When the last ECDH key is received, we immediately check if the cluster is now fully ready.

2. File: `pkg/mpc/key_exchange_session.go`

Change 2.1: Added callback mechanism to interface

type ECDHSession interface {
    // ... existing methods ...
    OnKeyExchangeComplete(callback func())  // NEW
}

Change 2.2: Added callback field to struct

type ecdhSession struct {
    // ... existing fields ...
    onKeyExchangeComplete    func()  // NEW
}

Change 2.3: Implemented callback setter

func (e *ecdhSession) OnKeyExchangeComplete(callback func()) {
    e.onKeyExchangeComplete = callback
}

Change 2.4: Trigger callback when ECDH completes

In the ECDH message handler, detect when all keys are received:

// Derive symmetric key using HKDF
symmetricKey := e.deriveSymmetricKey(sharedSecret, ecdhMsg.From)
e.identityStore.SetSymmetricKey(ecdhMsg.From, symmetricKey)

currentKeyCount := e.identityStore.GetSymetricKeyCount()
logger.Debug("ECDH progress", "peer", ecdhMsg.From, "current", currentKeyCount, "expected", len(e.peerIDs))

// Check if ECDH exchange is complete and notify callback
if currentKeyCount == len(e.peerIDs) && e.onKeyExchangeComplete != nil {
    logger.Info("ECDH key exchange completed successfully", "totalKeys", currentKeyCount)
    e.onKeyExchangeComplete()  // Notify registry
}

Key improvement: Immediately notify when ECDH is complete, allowing registry to update ready state.

How The Fix Works

Event-Driven Synchronization

The fix changes from timing assumptions to event-driven synchronization:

Before (timing-dependent):

Assumes: "If Consul shows N nodes, ECDH must be done"
Fails when network delays cause ECDH to lag behind Consul registration

After (event-driven):

Waits for BOTH actual events to confirm completion:
1. Consul registration complete (all peers visible)
2. ECDH exchange complete (callback triggered)
Works regardless of network timing or event ordering

Flow Comparison

Before (Broken):

1. Node0 registers in Consul
2. Node1, Node2 register in Consul
3. Node0 sees 3 nodes → sets ready=true ❌ (ECDH not verified!)
4. Node0 broadcasts ECDH key
5. Node1/Node2 haven't completed ECDH yet
6. Node1/Node2 stuck in retry loop

After (Fixed):

1. Node0 registers in Consul
2. Node1, Node2 register in Consul
3. Node0 checks: "All peers in Consul? YES. ECDH complete? NO" → ready stays false
4. Node0 broadcasts ECDH key
5. Node1, Node2 broadcast ECDH keys
6. Node0 receives last ECDH key → callback fires
7. Callback runs checkAndUpdateReadyState()
8. Now checks: "All peers in Consul? YES. ECDH complete? YES" → ready=true ✓
9. Similarly for Node1 and Node2
10. All nodes become ready, cluster operational

Testing

Verification Steps

Build the fixed code:
```
make build
```

Test distributed deployment:
Deploy to 3 separate servers and verify all nodes become ready:

# Check logs on each node
tail -f /home/ubuntu/.pm2/logs/mpcium0-dev-error.log
tail -f /home/void/.pm2/logs/mpcium1-dev-error.log
tail -f /home/ubuntu/.pm2/logs/mpcium2-dev-error.log

Expected logs on all nodes:

[timestamp] INF All peers are ready including ECDH exchange completion
[timestamp] INF SigningConsumer: Sufficient peers ready, starting message consumption
[timestamp] INF KeygenConsumer: All peers are ready, proceeding to consume messages

Test with artificial latency (optional):
Simulate production network conditions locally:

# Add 50ms delay to localhost
sudo tc qdisc add dev lo root netem delay 50ms

# Run E2E tests
make e2e-test

# Remove delay
sudo tc qdisc del dev lo root

Test Scenarios Covered

✅ Local deployment (all nodes on same machine)
✅ Distributed deployment (nodes on different servers)
✅ High network latency environments
✅ Nodes starting at different times
✅ Nodes restarting/reconnecting

Impact

Before

❌ Distributed deployments fail to start
❌ Nodes stuck in infinite ECDH retry loop
❌ Cluster never becomes operational
❌ Production environments unusable

After

✅ Distributed deployments work reliably
✅ All nodes properly complete ECDH and become ready
✅ Cluster becomes operational regardless of network conditions
✅ Production environments stable

Additional Notes

Why This Bug Was Hard to Detect

Works in local development: Most testing is done locally where latency is negligible
Misleading log message: The log claimed "ECDH exchange completion" but never verified it
Race condition: Only manifests under specific timing conditions (distributed deployment)
Silent failure mode: No errors logged, just appears as "waiting forever"

Network Factors That Affect Production

Latency: Inter-server communication (50-200ms) vs localhost (<1ms)
Packet loss: Production networks have 0.1-1% loss, localhost has none
Load variation: Different servers may be under different loads
Firewalls/routing: Production has VPCs, security groups that add latency
NATS topology: Production may use clustered NATS, adding complexity

Future Improvements

Consider adding:

Timeout mechanism if ECDH doesn't complete within X seconds
Metrics/monitoring for ECDH completion time
Health check endpoint showing ECDH status
Alerting if nodes are stuck in retry loop

Files Changed

pkg/mpc/registry.go - Fixed ready state logic and added callback handling
pkg/mpc/key_exchange_session.go - Added callback mechanism for ECDH completion

Breaking Changes

None. This is a bug fix that maintains backward compatibility.

Migration Guide

No migration needed. Simply deploy the updated binaries.

Fix ECDH key exchange race condition in distributed deployments

c42c6ea

anhthii force-pushed the fix-ecdh-key-exchange branch from 4f9aa32 to 65cf6a6 Compare November 1, 2025 02:49

Fix ECDH session parameter

cea0f1d

anhthii force-pushed the fix-ecdh-key-exchange branch from c70291a to cea0f1d Compare November 1, 2025 05:30

anhthii merged commit 4c8529f into master Nov 1, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ECDH key exchange race condition in distributed deployments#111

Fix ECDH key exchange race condition in distributed deployments#111
anhthii merged 2 commits into
masterfrom
fix-ecdh-key-exchange

anhthii commented Nov 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anhthii commented Nov 1, 2025

Fix ECDH Key Exchange Race Condition in Distributed Cluster

Problem Summary

Observed Symptoms

Root Cause Analysis

The Race Condition

Why It Manifests in Production But Not Locally

Local Environment (Masked Bug)

Production Environment (Exposed Bug)

Why The Bug Causes an Infinite Loop

Solution

Changes Made

1. File: pkg/mpc/registry.go

Change 1.1: Refactored ready state check

Change 1.2: Added checkAndUpdateReadyState() method

Change 1.3: Register ECDH completion callback

2. File: pkg/mpc/key_exchange_session.go

Change 2.1: Added callback mechanism to interface

Change 2.2: Added callback field to struct

Change 2.3: Implemented callback setter

Change 2.4: Trigger callback when ECDH completes

How The Fix Works

Event-Driven Synchronization

Flow Comparison

Before (Broken):

After (Fixed):

Testing

Verification Steps

Test Scenarios Covered

Impact

Before

After

Additional Notes

Why This Bug Was Hard to Detect

Network Factors That Affect Production

Future Improvements

Files Changed

Breaking Changes

Migration Guide

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. File: `pkg/mpc/registry.go`

Change 1.2: Added `checkAndUpdateReadyState()` method

2. File: `pkg/mpc/key_exchange_session.go`